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Abstract 

Objective: Heart failure (HF) is an important covarlate and outcome in studies of elderly populations and cardiovascular 
disease cohorts, among others. Administrative data is increasingly being used for long-term clinical research in these 
populations. We aimed to conduct the first systematic review and meta-analysis of studies reporting on the validity of 
diagnostic codes for identifying HF in administrative data. 

/Wef/jocfe; MEDLINE and EMBASE were searched (inception to November 2010) for studies: (a) Using administrative data to 
identify HF; or (b) Evaluating the validity of HF codes in administrative data; and (c) Reporting validation statistics 
(sensitivity, specificity, positive predictive value [PPV], negative predictive value, or Kappa scores) for HF, or data sufficient 
for their calculation. Additional articles were located by hand search (up to February 2011) of original papers. Data were 
extracted by two independent reviewers; article quality was assessed using the Quality Assessment of Diagnostic Accuracy 
Studies tool. Using a random-effects model, pooled sensitivity and specificity values were produced, along with estimates of 
the positive (LR-n) and negative (LR-) likelihood ratios, and diagnostic odds ratios (DOR = LR-h/LR-) of HF codes. 

Results: Nineteen studies published froml 999-2009 were included in the qualitative review. Specificity was >95% in all 
studies and PPV was >87% in the majority, but sensitivity was lower {>69% in >50% of studies). In a meta-analysis of the 1 1 
studies reporting sensitivity and specificity values, the pooled sensitivity was 75.3% (95% CI: 74.7-75.9) and specificity was 
96.8% (95% CI: 96.8-96.9). The pooled LR-H was 51.9 (20.5-131.6), the LR- was 0.27 (0.20-0.37), and the DOR was 186.5 
(96.8-359.2). 

Conclusions:\Nh\\e most HF diagnoses in administrative databases do correspond to true HF cases, about one-quarter of HF 
cases are not captured. The use of broader search parameters, along with laboratory and prescription medication data, may 
help identify more cases. 
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introduction 

Heart failure (HF) is a chronic condition that affects about 26 
milUon people worldwide [1] and imposes a tremendous burden 
on these individuals and their families. The typical 40 year-old 
faces a 20% lifetime risk of developing HF [2], and tlie incidence 
of HF amongst adults 65 years of age and older is approximately 
12.5 per 1,000 person-years [3]. About half of new cases are 
expected to die within five years of diagnosis [2], and estimates of 
the annual economic burden of HF have recendy exceeded $30 
biUion in the United States [2], and $108 billion worldwide [4]. 



The European Society of Cardiology describes HF as a disorder 
of cardiac structure or function where the heart is unable to deliver 
adequate levels of oxygen to the tissues [5]. Cases often have 
primary left systolic HF, which is characterized by "reduced 
contraction and emptying of the left ventricle" [5]. Still, many 
cases have left diastolic HF, in which ventricular compliance and 
filling are impaired [6] but the contractile function of the ventricle 
is preserved. HF has some 'classic' signs and symptoms, including 
ankle oedema, and exertional dyspnoea and fatigue [5,6]. 
However, HF is not considered to be a discrete condition but a 
"complex clinical syndrome" [6] that occurs in conjunction with 
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other cardiovascular diseases such as coronary artery disease, 
valvular heart disease, hypertension, dilated cardiomyopathy [6], 
and conduction and rhythm disorders [5] . A significant source of 
morbidity on its own, HF frequently occurs in concordance with 
other chronic disorders such as renal disease [7-9], chronic 
obstructive pulmonary disease (COPD) [8,10-12], and diabetes 
[7-11]. Thus when evaluating treatments for these and other 
chronic conditions, it is essential to adjust for diagnoses of HF. 

Administrative databases have become excellent resources for 
the study of HF by allowing for long-term evaluation of large 
numbers of patients at relatively low cost. Some examples are the 
Medicare databases in the United States (USA) and health 
ministry databases from countries such as Canada where 
healthcare is funded by provincial governments and available to 
all residents. These data sources allow the patient-level linkage of 
health resource utilization data (including hospital separations, 
outpatient encounters, and sometimes, dispensed prescriptions) to 
demographic and vital statistics data. When studying clinic-based 
populations, patients with severe HF are likely to be overrepre- 
sented, but administrative databases provide a means for 
identifying risk factors for HF, and quantifying the effects of 
treatment in unselected populations. 

However, administrative databases are only useful for HF 
research if the diagnostic codes contained within are valid; that is, 
if they can be used to distinguish those who actually have HF from 
those who do not. Their validity can be assessed by comparing the 
administrative database diagnosis to an accepted 'gold standard' 
reference diagnosis. This diagnosis is typically obtained through 
more resource-intensive processes such as patient self-report, 
retrospective chart review, or prospective clinical examination. 
Principal measures of validity include sensitivity (how many HF 
cases in the population are actually coded for HF) and specificity 
(how many of the non-HF cases in the population are, in turn, not 
coded for HF). Unfortunately, there is some uncertainty 
surrounding the validity of diagnoses recorded in administrative 
databases since most databases are not established for research 
purposes. Validity is of particular concern when studying HF 
patients, as they tend to have high comorbidity burdens and be 
hospitahzed for other cardiovascular and respiratory conditions 
[13,14]. While HF may have contributed to the need for these 
hospitalizations, this diagnosis may not be entered on the 
discharge record, leaving this potential confounding variable to 
go undetected in subsequent epidemiologic investigations. Al- 
though several assessments of the validity of HF codes in 
administrative databases have been published [15-17], there is 
considerable heterogeneity amongst them with regards to the 
clinical settings and reference standards used. Of note, many of 
these assessments were limited to specific populations (e.g. those 
diagnosed with atrial fibrillation [15] or myocardial infarction (MI) 
[16]) so may not be generalizable to the HF diagnoses recorded for 
other individuals. 

As a part of a Canadian Rheumatology Network for establishing 
best practice's in the use of administrative data for health research 
and surveillance (CANRAD) [18-22], we have conducted a 
systematic review of studies reporting on the validity of diagnostic 
codes for identifying cardiovascular diseases (CVD) in adminis- 
trative data. Data from these studies were used to compare the 
validity of these codes, and to evaluate whether administrative 
health data can accurately identify CVD for the purpose of 
identifying these events as covariates, outcomes, or comphcations 
in future research. We recently reported our findings on the 
validity of codes for MI [23]. In the current paper, we focus on HF 
and undertake both a qualitative analysis, and for the first time, a 



quantitative synthesis of studies reporting on the validity of HF 
codes in administrative databases. 

Methods 

Literature Search 

Comprehensive searches of the MEDLINE and EMBASE 
databases from inception (1946 and 1974, respectively) to 
November 2010 for all available peer-reviewed Kterature were 
conducted by an experienced librarian (M-DW). Two search 

strategies were employed: (1) All studies where administrative data 
was used to identify CVD; (2) All studies reporting on the validity 
of administrative data for identifying CVD. Our MEDLINE and 
EMBASE search strategies are available as (Text SI and S2). To 
find additional articles, the authors hand-searched the reference 
lists of the key articles located through the database search. The 
Cited-By tools in PubMed and Google Scholar were also used to 
find relevant articles that had cited the articles located through the 
database search (up to February 2011). The tides and abstracts of 
each record were screened for relevance by two independent 
reviewers. No protocol for this systematic review has been 
published, though the review was conducted in accordance with 
the Preferred Reporting Items for Systematic Reviews and Meta- 
Analyses (PRISMA) [24] and Meta-Analysis of Observational 
Studies in Epidemiology (MOOSE) [25] statements; our complet- 
ed checklists are provided as (Checklist SI and S2). 

Inclusion Criteria 

We selected full-length, peer-reviewed articles published in 
English that used administrati\ e data and reported validation 
statistics for the main International Classification of Diseases (ICD) 
codes for HF (ICD-8 and ICD-9 428, and ICD- 10 150), or 
provided sufficient data enabling us to calculate them. Any 
discrepancies were discussed until consensus was reached. When 
the confUct persisted a third reviewer (JAA-Z) was consulted. 

Data Extraction 

The fuU text of each selected record was examined by two 
independent reviewers (NM and VB) who abstracted data using a 
standardized collection form (a copy is provided in Text S3). 
While extracting data, particular attention was given to the study 
population, administrative data source, algorithm used to identify 
HF, validation method, and gold standard. Validation statistics 
comparing the HF codes to definite or possible cases were 
abstracted. These statistics included sensitivity, specificity, positive 
predictive value (PPV), negative predictive value (NPV), and 
kappa scores. Because hospital separations typically contain 
multiple diagnoses, with the primary or principal diagnosis in 
the first position followed by one or more secondary diagnoses, we 
abstracted statistics for each of these positions, where available. 
This was especially important given some recent studies of 
administrative databases that suggest hospitalizations with HF in 
the primary position are decreasing, while those with HF coded in 
secondary diagnostic positions are increasing [26,27]. Data were 
independendy abstracted by each reviewer who subsequendy 
compared their forms to correct any errors and resolve discrep- 
ancies. 

Quality Assessment 

The design and methods used by each study, including the 
rigour of the reference standard, can directiy influence the validity 
statistics produced. Thus, all studies were evaluated for quality, 
and the validation statistics were stratified by level of study quality. 
This was done using the Quality Assessment of Diagnostic 
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Accuracy Studies (QUAD AS) tool [28] (available as a part of Text 
S3), used previously by the CANRAD network in assessing the 
validity of codes for diabetes meUitus [21], osteoporosis and 
fractures [22], and myocardial infarction [23]. Briefly, it is a 14- 
item evidence-based quality assessment tool used in systematic 
reviews of diagnostic accuracy studies. Each item, phrased as a 
question, addresses one or more aspects of bias or applicability; 
however, there is no overall score. Instead, as done previously 
[22,23], items were independendy answered by each reviewer and 
used to qualitatively assess each study as High, Medium, or Low 
quality. Any disagreements were resolved by consensus. 

Statistical Analysis 

All validation statistics were abstracted as reported. Where 
sulficient data were available we calculated 95% confidence 
intervals (9^% CI) and additional validity statistics not dirc-rtly 
reported in the original publication. These were evaluated on 
aggregate, and, as pre-specified, stratified by geographic region 
and time period of publication. In evaluating the HF codes in 
administrative data, we considered the diagnosis assigned during 
the validation process to be the diagnostic gold standard; this 
meant, for instance, that cases coded for HF and classified as HF 
during validation were true-positive cases, while cases coded for 
HF but classified during validation as no-HF were false-positives. 
Sensitivity (the ability of the codes to identify true positive HF 
cases) was equal to the number of true positives divided by the sum 
of true positives and false- negatives (all those with HF). Specificity 
(the ability of the HF codes to exclude false-positive cases) was 
equal to the number of true negatives divided by the sum of true 
negatives and false positives (all those without HF). Sensitivity and 
specificity were also used to calculate the positive and negative 
likelihood ratios (LR+ and LR— ) and diagnostic odds ratios 
(DOR). The DOR (the ratio of the odds that coded individuals will 
actually have HF compared to the odds that non-coded individuals 
wiU have HF) was equal to the LR+ divided by the LR— . 

The positive likelihood ratio (LR+), the ratio of true-positives to 
false-positives amongst all those coded for HF, was equal to the 
sensitivity divided by 1 - specificity. The negative likelihood ratio 
(LR— ), the ratio of false-negatives to true-negatives amongst all 
those not coded for HF, was equal to 1 - sensitivity divided by the 
specificity. Thus, higher LR+ values (those greater than 1) mean 
the presence of an HF code is more indicative of true HF and 
lower LR- values (those closer to 0 than 1) mean the absence of an 
HF code is more indicative of non-disease (no HF). Specificity 
values topically fall close to 1 , such that the denominator for LR+ 
(1 - specificity) is usually much smaller than the denominator for 
LR— . As a result, the values for LR+ (which range from 1 to 10 or 
more) are usually much larger than those for LR— (which range 
from 0 to 1). An LR-I- of 5 to 10 means the codes are moderately 
good for detecting HF, and an LR+>10 means the codes are ver\' 
good. Similarly, with an LR— of 0.1 to 0.2, the absence of an HF 
code corresponds moderately to non-disease (no HF), while an 
LR— of <0.1 corresponds very well to non-disease [29]. 

Three other validation statistics of interest were PPV, NPV, and 
kappa score. The PPV (the likelihood that the HF code 
corresponds to a true-positive case) was equal to the number of 
true positi\ es divided by the total number of cases coded for HF 
(true-positives and false-positives). NPV (the likelihood that an 
individual not coded for HF is a true-negative case) was equal to 
the number of true negatives divided by the total number of cases 
not coded for HF (true-negatives and false-negatives). Kappa (a 
measure of the agreement, beyond that expected by chance, 
between how cases are classified in the administrative database 
and by the validation process) was equal to the observed 



agreement (the percent of cases classified as either true-positives 
or true-negatives) minus that expected by chance, divided by 
[100% - the agreement expected by chance]. Kappa scores greater 
than 0.60 indicate substantial/almost perfect agreement, 0.41— 
0.60 is considered as moderate agreement, 0.21-0.40 as fair 
agreement, and those 0.20 or lower as light/poor agreement [30]. 

Where available, we abstracted statistics for definite and 
possible cases of HF, though the number of categories reported 
depended on the choice of reference standard. In some studies, the 
reference standard is the presence of any notation of an HF 
diagnosis in the medical chart, and cases are classified simply as 
HF or no HF. The Framingham criteria [3 1] also classify cases as 
either HF or no HF; at least two of the major Framingham criteria 
(which include neck vein distension, cardiomegaly, and acute 
pulmonary edema) or one major criterion and two minor criteria 
(which include ankle oedema, hepatomegaly, and plural effusion) 
must be met for the diagnosis of HF. Other sets of standard criteria 
do allow for further classification. The Carlson criteria [32] use a 
points system in which potential cases are evaluated in three 
categories (history, physical examination, and chest radiography), 
and allocated a maximum of four points in each category, and a 
maximum overall score of 12. A score oft! or more is considered 
Definite HF while 5-7 points are considc-rc-d Possible HF, and 4 or 
fewer points are classified as Unlikely HF [32]. Under the 
European Society of Cardiology criteria [33], for a case to be 
classified as HF there must be both signs and symptoms of HF, 
and objective evidence of cardiac dysfunction. Some investigators 
[17,34] have classified cases meeting both of these criteria as 
Definite HF, and those meeting only one of these criteria as 
Questionable, Possible, or Probable HF. It should be noted that 
while the New York Heart Association functional classification is 
used to measure the degree of functional limitation experienced by 
HF patients, and may assist in the selection of therapies [5], it is 
not used to make the initial diagnosis of HF. 

Meta-Analysis 

Studies that reported raw data for sensitivity and specificity were 
included in the meta-analysis. Forest plots and a summary receiver 
operating characteristic (ROC) curve were constructed, and 
pooled estimates (and 95'1'n CPs) of the sensitivity and specificity 
values, LR+, LR— , and DORs were calculated. More informative 
diagnostic tests (in this case, being HF codes) - those with good 
sensitivity and good specificity - wUl produce ROC curves 
positioned in the top-left area of the ROC plane, well away from 
the positive diagonal line [35] . Two additional summary measures 
of test performance were determined from the ROC curve, the 
area under the curve (AUG) and Q*. The AUC ranges between 0 
and 1, with 1 corresponding to a perfect test [36]. In the context of 
our research question, an AUC of 1 would mean that, given two 
cases, one with HF and one without, there is a 100% probability 
that the positive case will be coded for HF and the negative case 
will not. The Q*, the lower bound of the AUC, is the point at 
which the sensitivity and specificity are equal [36]. Higher Q* 
values indicate better-performing tests. 

To assess for the presence of heterogeneity amongst the 
included studies, we visually inspected the forest plots and ROC 
curve, and calculated the "/f statistic, Cochran's Q[37] and I^ [38] 
statistics. The I^ index, a measure of the degree of inconsistency 
across study findings, is expressed as the percentage of variation 
between studies due to heterogeneity as opposed to chance [38]. A 
value of 0% indicates no observed heterogeneity, while 25% is 
indicative of low heterogeneity, 50% moderate, and 75% high 
heterogeneity [38]. When there are a small number of studies, the 
I^ index is a preferred measure over Cochrane's Q [38]. In the 
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absence of substantial heterogeneity, a fixed-efFects model was to 
be applied. Otherwise, a random-efiects model was to be applied, 
using the DerSimonian Laird method. 

To assess the impact each individual study had on the pooled 
estimates, a jackknife sensitivity analysis [39] was performed in 
which one study was removed and all summary statistics were re- 
calculated. This process was repeated for all studies. The impact of 
publication bias was not evaluated as the common tests available 
to assess publication bias, including the Begg, Eggcr, and MacaskUl 
tests, ha\'e been shown to be misleading for meta-analyses of test 
accuracy [40]. All analyses were conducted using Meta-Disc 
software, version 1.4 [41]. 

Results 

Literature Search 

After the removal of duplicates, 1,587 citations were identified 
through MEDLINE and EMBASE searches and screened for 
relevance to our study objectives. We then assessed 98 full-text 
articles for eligibility (Figure 1), of which 12 were selected for 
inclusion. We also assessed 30 full-text articles for eligibility that 
were identified from hand searches, and selected 7 additional 
articles therein. Thus, a total of 128 articles were assessed for 
eligibility, from which 109 were excluded, mainly because they 
reported on the validity of other CVD (n = 59), or did not actually 
validate HF diagnoses in administrative data (n — 20). Six articles 
were excluded because they were not published in English; their 
languages of publication were Danish, German, Italian, Japanese, 
Portuguese, and Spanish. Ultimately 19 articles were included for 
the qualitative systematic review of HF. 

Study Characteristics 

Of the 1 9 articles evaluating HF diagnoses that were included in 
the final review, nine (47%) were from the United States, six (32%) 
were from Canada, three (16%) from Europe, and one was from 
Austraha. Characteristics of these studies are presented in 
Table 1. Validation was the primary objective in all but one 
[42] of these studies. Six studies [17,34,42-45] reported on the 
validity of HF exclusively, while 13 reported on the validity of 
other diagnoses as well. Overall, data were collected over a thirt)'- 
year period (1976-2005), though the studies were all published 
relatively recentiy (the earliest in 1999 [16]). PPV data were 
available from all but one [16] study, sensitivity and NPV data 
were available from 14 studies (74"/!)), and specificity data from 13 
studies (68%). Kappa scores were less frequently reported. Only 
four studies [17,46^8] reported on the validity of ICD-10 codes 
separately from ICD-9. Most of the administrative databases 
pertained to hospitalizations though diagnoses recorded for 
outpatient encounters were included in five studies 
[42,44,45,49,50]. None of the studies reported on the validity of 
HF as a cause-of-death. 

Chart reviews, sometimes in conjunction with unspecified 
diagnostic criteria, formed the basis of the gold standard in nine 
(47%) studies [15,16,43,46-48,50-52], and patient self-report was 
used in one [49]. Cardiac disease registries were used in two 
studies [53,54], while a specific set of diagnostic criteria were 
incorporated in the reference standards of the seven remaining 
studies [17,34,42,44,45,55,56]. 

Study quality was evaluated based on the QUAD AS tool [28], 
with 11 of 19 studies (58%) categorized as high quality, seven as 
medium (37%), and just one (5%) as low quality. A detailed 
breakdown of the quality assessment for each study is provided in 
Table SI. Of the seven medium-quality studies, two did not 
adequately describe the validation process [47,48], and five used a 



less-reliable gold standard than published diagnostic criteria (being 
patient self-report in one [49] and chart review by an individual 
other than a clinician or trained hospital coder in four 
[15,16,50,51]). The low-quality .study [52] employed a very select 
source population (women at one institution undergoing vaginal 
hysterectomy), assessed only two potential cases of HF, and did not 
adequately describe the vaUdation process. 

Validity of Heart Failure Diagnoses 

The validation statistics reported by each of the included studies 
are provided in Table 2 . Sensitivity was reported by 1 4 studies, and 
was S69% in half of them (range: 0 to 87%). PPV was undefined 
(0/0) in one of the studies [52], but was at least 87% in nine of the 17 
remaining studies (range: 34 to 100%). Specificity was S95% in all 
1 3 studies reporting this statistic, and NPV was S88% in all but two 
of the 1 4 studies where this data was available. Kappa was only 
reported in six (32%) studies [43,46,47,51,53,55]. The values in 
three of the studies (which ranged from 0.43 to 0.58) indicated there 
was moderate agreement between the diagnostic codes and 
reference standard, while those in the other three (range 0.72 to 
0.94) indicated there was substantial to almost perfect agreement. 

The Framingham criteria were used in three studies, with the 
PPV's reported as 65% [45], 82% [42], and 94% [56]. One of 
these studies [56] used both the Framingham and Carlson criteria, 
and found higher accuracy with the Framingham (PPV = 94%) 
than with the Carlson (PPV = 89%). The European Society of 
Cardiology criteria were used in two studies; the PPV for definite 
HF was 82% in one [34] but just 65% in the other [17]. Sex- 
stratified statistics were provided by two studies; one [56] reported 
a shghtiy higher PPV for ICD-9 428 in females, and the other [43] 
found that the sensitivity of ICD-9 428 was significantiy better in 
females than males (66% vs. 61 %). In that study, the sensitivity was 
also significantiy better amongst Mexican Americans than Non- 
Hispanic whites (66% vs. 59%) [43]. 

The studies in Table 2 are ordered chronologically by 
publication year for the purpose of identifying any secular trends 
in the vaUdity of HF codes. The nine-earfiest studies included in this 
review were published between 1999 and 2004, with the ten 
remaining studies published from 2005 to 2009. However, no 
secular trends wert; ol)scr\ (xi for any of the validation statistics. 
Eleven (58%) of the studies included in this review were rated as 
high quality and seven (37%) as medium cjusdit)-. Sensitivity ranged 
from 43% to 87% amongst the high quality studies, and from 58% 
to 87% amongst the medium quality. The PPV's for these two 
categories were also similar, ranging from 36% to 99°/) amongst the 
high quality studies, and from 34% to lOO"/!) amongst the medium 
quality. Any geographic comparisons were limited by the fact that 
15 of the 19 studies were conducted in North America. The only 
difference observed was that the sensitivity values tended to be 
higher amongst the seven US studies than the five Canadian ones. 

iVleta-Ana lysis 

Included in the quantitative synthesis were the 1 1 articles from 
which raw data on sensitivity and specificity were available. Forest 
plots of the pooled sensitivity and specificity values are illustrated 
in Figure 2. A random-effects model was used since the x^, Qfj 
and I^ statistics indicatc'd there was a high level of heterogeneity 
between studies. The pooled sensitivity was 75.3"/o (95'/) CI: 74.7— 
75.9) and tiie pooled specificity was 96.8% (95% CI: 96.8-96.9). 
The summary LR-I- was 51.9 (95% CI: 20.5-131.6) and tiie 
summary LR- was 0.27 (95% CI: 0.20-0.37), giving a summary 
DOR of 186.5 (95% CI: 96.8-359.22). The summary ROC curve 
is illustrated in Figure 3, wherein the AUC was 0.93 (SE 0.0396) 
and tiie Qf was 0.86 (SE 0.0466). 
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statistics (n=20) 
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abstract (n=7) 
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19 studies included in the qualitative systematic review of 
HF diagnoses: 

-12 identified through database searching 
-7 identified through other sources 



11 studies included in the quantitative review/ 
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Figure 1. Preferred Reporting Items for Systematic Reviews and iVIeta-Analyses (PRISMA)-style Flowchart of Study Selection and 
Review. HF = heart failure; ICD = International Classification of Diseases 
doi:1 0.1 371 /journal.pone.01 0451 9.g001 



Results of the jackknife sensitivity analysis, wherein the pooled 
estimates were re-calculated after the removal of one study at a 
time, are shown in Table 3. The resulting pooled sensitivity 



estimates ranged mainly from 74% to 76%, and the specificities 
from 96.6% to 96.9%. The two most influential studies were those 
by Onofrei et al (sensitivity = 77.5% and specificity = 93.3% after 
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its removal) and Austin el al (sensitivity = 71.8% and specifici- 
ty = 99. 1 % after its removal). 
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To our knowledge this is the first systematic review and meta- 
analysis on the validity of HF diagnoses in administrative data. 
Findings from this review suggest that the sensitivity of these codes 
is suboptimal, as sensitivity was ^69% in 8 of the 14 studies 
reporting this statistic. However, the specificity and PPV of these 
codes appears much better: specificity was at least 95% in the 13 
studies where this statistic was reported, and, in the majority of 
studies, the PPV was at least 87%. Further support was provided 
by the results of the meta-analysis, as the pooled specificity of HF 
codes was 97%, and the pooled LR+ was 52. This means an 
individual coded for HF is fifty-two-times more likely to actually 
have HF than someone not coded. However, the pooled sensitivity 
was modest, at just 75%, and the summary LR- value of 0.27 
suggests that the absence of an HF code can rule out the diagnosis 
of HF only moderately. 

The PPV's and NPV's amongst the studies included in this 
review were generally good, being at least 87% in the majority of 
studies reporting these statistics. A recently-published qualitative 
review of the validity of HF codes in North American databases 
also found the PPV to be generally high (>90% in most) [57]. 
However, it must be kept in mind that PPV and NPV are both 
dependent on the prevalence of the condition in the study 
population [35], and will be lower for rare conditions than for 
common conditions. This is important for HF because this 
condition difiFerentiaUy affects older individuals: for example, HF is 
reported to affect approximately 7.8% of US males aged 60-79 
years, but only 1.5% of US males aged 40-59 years [2]. A higher 
baseline risk of HF in the study population may explain why 
several studies included in this review reported exceptionally high 
PPV's. For example, in the study by Szeto et al [50], which was 
conducted amongst a cohort of patients attending a Veteran's 
Affairs clinic, the prevalence of HF was 10%, and the PPV was 
100%. So el al [48] examined the charts of patients hospitalized 
for MI, amongst whom the prevalence of HF was 29%, and the 
PPV in that study was 94%. Similarly, the prevalence of HF was 
47% amongst the atrial fibrillation cohort studied by Birman- 
Deych et al [15], and the PPV was 97%. Consequently, if the 
exclusion of false-positive HF cases is of upmost priority for a 
particular study, the age and disease history of the study 
population must be taken into account when evaluating how 
accurately these codes will identify true HF cases. 

Findings from our review suggest that administrative data codes 
are less-than-optimal for capturing HF cases, and this is consistent 
with another qualitative review of the validity of HF codes in 
which the sensitivity of HF diagnoses was highly-variable [58]. 
Instead of HF itself, some authors have suggested there is a 
tendency to list the underlying cause of the HF (such as MI or 
atrial fibrillation [17]), or another cardiac condition [53], in the 
primary position of the hospital discharge summary. Moreover, 
hospital coders generally report active conditions [5 1] such as MI, 
but may leave out chronic conditions such as HF if they were 
deemed not to have impacted the treatments provided in hospital 
or length-of-stay [47,58] . Similarly, the study by Birman-Deych et 
al, where sensitivity for HF increased with disease severity (from 
80% for mild cases to 94% for severe [15]), suggested that severe 
cases of HF may be recorded more often in administrative 
databases than mild ones. Thus, to maximize the capture of HF 
cases, authors are advised to broaden their search parameters by 
examining all diagnostic positions of the hospital record, 
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Figure 2. Forest Plots of Sensitivities and Specificities of Heart Failure Codes as Reported by Included Studies. 95% CI = 95% 

confidence interval; DF = degrees of freedom. 
doi:1 0.1 371 /journal.pone.01 0451 9.g002 



considering the inclusion of more HF-related codes (other than 
ICD-9 428 or ICD- 10 150) in the search algorithm, and, where 
available, searching for HF cases in both hospitalization and 
outpatient databases. 

Sources of Administrative Data 

While the improvements were not substantial, some studies we 
reviewed suggested HF cases could be identified more accurately if 
algorithms combining hospital codes with prescription data were 
applied. For example. Rector et al [49] tested several algorithms to 
identify HF, some of which incorporated prescription claims for an 
angiotensin converting enzyme (ACE) inhibitor, angiotensin-II 
receptor antagonist, loop diuretic, or digoxin. When using an 



algorithm that required a healthcare encounter and HF-related 
prescription, the specificity was high (92%), though the sensitivity 
was modest (53%) [49]. While further research is needed in this 
area, findings from that study also suggested that HF cases could 
be identified from prescription data alone, as the specificity of an 
algorithm that included a single HF-related prescription, but no 
healthcare encounters, was 78% [49]. This occurred despite the 
fact that many medications used in the treatment of HF are also 
used to treat other conditions [49]. Hence, prescription medica- 
tion data could be used to vahdate HF cases frrst identified from 
hospital or outpatient data, or used alone to identify HF cases in a 
sensitivity analysis. Requiring that cases be dispensed a combina- 
tion of medications - for example, each of a diuretic, ACE 
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Figure 3. Receiver Operating Characteristic Curve Describing the Diagnostic Performance of Heart Failure Codes In Administrative 
Databases. AUC = Area Under the Curve; SROC = Summary Receiver Operating Characteristic; SE = standard error. 
doi:1 0.1 371/journal.pone.01 0451 9.g003 



inhibitor or angiotensin-II receptor antagonist, and beta blocker - 
may improve specificity. At this time the prescription databases in 
many countries only include government-subsidized prescriptions, 
limiting the potential of this data source for identifying cases. 
Thus, prescription medication data should only be used if the 
database contains records on all community-dispensed prescrip- 
tions, regardless of payer, or at least all prescriptions dispensed to 
senior citizens. 

Laboratory databases may also be a useful source for identifying 
HF. Specifically, levels of B-type natriuretic peptide(BNP) are 
often elevated in patients with left ventricular HF [6], so 
individuals with high BNP values could be identified as HF cases. 
One study in this review, by Alqaisi et al [45], compared the 
accuracy of different algorithms for identifying HF, some of which 
included BNP levels, and the highest-sensitivity algorithm in that 
study was a 2 outpatient encounters for HF, or a 1 hospitalizations 
for HF, or a BNP level of &200 pg/ ml. That algorithm achieved a 
sensitivity of 76% and a specificity of 75%. BNP levels can be 
elevated in conditions other than HF, such as pulmonary 
embolism and chronic obstructive pulmonary disease [6], which 
may limit the specificity of BNP levels for identifying HF. 
However, additional findings from the Alqaisi et al study [45], 



where the specificities of BNP levels of a 100 pg/ml, 200 pg/ml, 
and 500 pg/ml (without considering any diagnoses from 
healthcare encounters) were 76%, 88%, and 95%, respectively, 
suggest this test is reasonably specific for HF. Any potential 
increases in sensitivity will be limited by the fact that BNP levels 
tend to be elevated more in HF patients with left systolic 
dysfunction than diastolic dysfunction [6]. Another caveat is that 
BNP is less sensitive a test in non-acute HF [5] . Where laboratory 
data are available, we suggest researchers incorporate BNP levels 
into their case definition and, in a sensitivity analysis, compare the 
HF cases identified with- and without BNP levels. 

Reference Standards 

Findings from our meta-analysis suggested a high degree of 
heterogeneity amongst the included studies; thus, a random-effects 
model was used to produce the summary measures. Part of this 
heterogeneity can be attributed to differences in the characteristics 
of the study populations. Some studies were community-based or 
conducted on a general hospitahzed population while others were 
conducted on select populations (i.e. elderly people or those with a 
history of MI) in whom HF is more prevalent. Variations in the 
size of the study population may also have contributed, as there 
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Table 3. Results of Sensitivity Analysis Using a Jackknife Approacli. 



Omitted Study: First 
Author, Year of 
Publication 


Records 
Evaluated (N) 


Pooled Sensitivity 
(95% CI) 


Pooled Specificity 
(95% CI) 


Pooled LR+ 

(95% CI) 


Pooled LR- 
(95% CI) 


Pooled Diagnostic 
OR (95% CI) 


All studies included 


371,055 


75.3 (74.7- 


-75.9) 


96.8 (96.8 


-96.9) 


51.9 (20.5- 


-131.6) 


0.27 (0.20- 


-0.37) 


186.5 (96.8-359.22) 


Austin [53], 2002 


58,816 


71.8 (71.1- 


-72.5) 


99.1 (99.1 


-99.1) 


62.1 (35.2- 


109.6) 


0.28 (0.21- 


-0.38) 


220.2 (119.1-407.3) 


Birman-Deych [15], 2005 


23,657 


74.6 (73.7- 


-75.4) 


96.8 (98.7 


-96.9) 


53.4 (21.2- 


-134.7) 


0.27 (0.18- 


-0.40) 


193.3 (90.0-415.2) 


Chen [47], 2009 


4,008 


75.4 (74.8- 


-76.0) 


96.8 (96.8 


-96.9) 


48.8 (18.6- 


128.1) 


0.26 (0.19- 


-0.37) 


1 76.9 (89.4-350.0) 


Goff [43], 2000 


5,083 


76.2 (75.6- 


-76.8) 


yo.y tyo.o 


-96.9) 


59.2 (21 .5- 


-1 62.9) 


U.ZD ^U. 1 o- 


-0.37) 


221.9 (108.3-4547) 


Heckbert [55], 2004 


34,016 


75.2 (74.6- 


-75.8) 


96.8 (96.7 


-96.8) 


545 (18.3- 


162.3) 


0.28 (0.20- 


-0.38) 


1 89.9 (93.8-384.6) 


Henderson [46], 2006: 
"1 998-1 999 


7,004 


75.1 (74.5- 


-75.7) 


96.8 (96.7 


-96.8) 


46.7 (17.9- 


121.8) 


0.29 (0.21- 


-0.40) 


1 54.4 (80.5-296.0) 


Henderson [46], 2006: 
2000-2001 


7,631 


75.2 (74.6- 


-75.8) 


96.8 (96.7 


-96.8) 


441 (16.9- 


115.2) 


0.29 (0.21- 


-0.39) 


147.8 (76.6-285.3) 


Merry [54], 2009 


21,110 


75.6 (75.0- 


-76.1) 


96.6 (96.6 


-96.7) 


41.8 (16.0- 


109.2) 


0.25 (0.18- 


-0.34) 


159.7 (81.8-311.6) 


Onofrei [44], 2004 


205,756 


77.5 (76.9- 


-78.1) 


93.2 (93.0 


-93.3) 


49.8 (19.0- 


130.0) 


0.25 (0.20- 


-0.32) 


193.56 (92.4-405.3) 


Rector [49], 2004 


3,633 


75.5 (74.9- 


-76.1) 


96.9 (96.8 


-96.9) 


62.2 (22.7- 


1 70.8) 


0.26 (0.18- 


-0.35) 


235.5 (118.8-467.1) 


So [48], 2006 


193 


75.3 (74.7- 


-75.9) 


96.8 (96.8 


-96.9) 


55.9 (21.2- 


147.6) 


0.28 (0.20- 


-0.38) 


193.1 (97.6-381.9) 


Szeto [50], 2002 


148 


75.3 (74.7- 


-75.9) 


96.8 (96.8- 


-96.9) 


48.1 (18.6- 


124.9) 


0.28 (0.20- 


-0.38) 


174.9 (89.9-340.3) 



95% CI = 95% confidence interval; LR-t- = positive likelihood ratio; LR- = negative likelihood ratio; OR = odds ratio. 
doi:l 0.1 371 /journal.pone.Ol 0451 9.t003 



were changes, though not substantial, in the pooled sensitivity and 
specificity estimates after each of the two largest studies 
(n = 205,755 and n = 58,816) were removed. 

More importantly, there was much heterogeneity in the 
reference standards used by different studies. This was not 
surprising as there is no single accepted gold standard for the 
diagnosis of HF, and a definitive diagnosis of HF is often difficult 
[53,58], especially in elderly patients with multiple complications 
[34]. The reference standards used included patient self-report, 
chart reviews by clinicians and non-clinicians, two distinct disease 
registries, and the application of several sets of standard diagnostic 
criteria including the Framingham, Carlson, and European 
Society of Cardiology (ESC). One study included in our review, 
by Onofrei el al [44], reported both low sensitivity (44%) and low 
PPV (36%) for HF codes, which could be explained, in part, by 
their choice of reference standard. It consisted of a single 
measurement, left ventricular ejection fraction (LVEF) of either 
<55% or <40%. The LVEF is not typically used for HF 
diagnosis, but instead for classifying HF patients with left 
ventricular systolic or diastolic dysfunction [59]. Although the 
thresholds vary, an LVEF below 40% or 50% is usually indicative 
of systolic dysfunction, while higher LVEF values in HF patients 
are usually indicative of diastolic dysfunction but preserved systolic 
function [59]. It is possible that some of the false-positive cases in 
that study (coded for HF but whose LVEF measurement did not 
fall below the thresholds) exhibited other signs and symptoms that 
would fulfill the criteria for HF under the less-restrictive 
Framingham, Carlson, or ESC definitions. Thus, this choice of 
reference standard may have attenuated the PPV. 

Furthermore, although the Framingham and Carlson criteria 
have been shown to be 100% sensitive to cases of definite HF [60], 
especially severe cases [61], the Framingham criteria are 
considered by some to be insensitive for detecting early HF 
[62,63]. Therefore, the application of standard diagnostic criteria 
may also attenuate the PPV. In this review, we did observe a trend 
towards greater PPV (80-100%) when simply a physician's written 
confirmation of HF diagnosis or other notation in the medical 



chart was used as a gold standard [15,42,50]. In fact, Roger et al 
[42] compared two gold standards, physician diagnosis (as written 
in the chart) and the Framingham criteria, and found that the PPV 
from physician diagnosis was higher compared with the Framing- 
ham criteria (90% vs. 82%). Although the physician diagnosis may 
be more subjective, it may better reflect the diagnoses made in 
day-to-day clinical practice and thus be more meaningful to health 
researchers. 

In addition to prescription medication and laboratory data, a 
third resource that could be used in conjunction with conventional 
administrative (billing) data to capture more HF cases is the 
electronic medical record (EMR), or electronic health record 
(EHR). The EMR or EHR is a digital file used by healthcare 
providers for patient care [64]. Though some authors use EMR 
when referring to the digital file maintained by a single 
practitioner, and EHR when referring to a digital file containing 
inpatient and outpatient data from multiple practitioners, for 
simplicity we wiH employ a single term, EMR, in this discussion. 
The materials available in the EMR can vary, but generally 
include clinical notes (similar to those recorded in a paper medical 
chart), prescription records, and laboratory and radiology reports 
[64]. With access to EMRs, researchers can identify HF cases by 
searching for an ICD code for HF amongst the patient's Problem 
List, a list maintained by the practitioner of all current and active 
diagnoses. In addition, researchers can also search for the term 
'heart failure' amongst the entire free-text areas of the clinical 
notes, laboratory and radiological reports, and any correspon- 
dence from specialists. For example, the Mayo Clinic has used a 
natural language processing (NLP) algorithm containing the terms 
'cardiomyopathy', 'heart failure', 'congestive heart failure', 'pul- 
monary edema', 'decompensated heart failure', 'volume overload' 
and 'fluid overload', along with 426 synonyms for these terms [65]. 
A potential case is eliminated if a negative term (such as 'no' or 
'unlikely'), or sometimes even a speculative term (such as 'rule-out' 
or 'suspected') is found within close proximity of the HF term 
[65,66]. For chronic conditions like HF that may be superseded by 
other diagnoses on reimbursement claims, this EMR-based search 
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strategy may be more sensitive than relying on the diagnostic 
codes in claims-based administrative databases, and may aid in 
capturing milder cases. 

However, the EMR also has some limitations, many of which 
stem from the fact that, similar to most administrative databases, 
the EMR was not established for research purposes [64]. For 
example, when searching the free-text notes and reports, computer 
programs may have difficulty processing whether ambiguous 
phrases like "cannot be ruled out" [67] correspond to a positive 
case. In addition, while the information contained in administra- 
tive databases is already de-identified, the EMR does contain 
personally-identifying information. US law stipulates that, unless 
each patient provides consent [68], researchers cannot use data 
collected from the EMR without it undergoing a de-identification 
process [69]. Thus, it may be costly and time-consuming for 
researchers to access this data, especially as many hospitals do not 
use de-identification tools at present [68] . A lack of standardiza- 
tion across EMR systems [64] and challenges in linking EMRs 
from different hospitals or provider networks [70] may also limit 
the use of EMR. Finally, just as with ICD codes, the validity of the 
HF cases identified from the free-text areas of the EMR must be 
assessed before EMRs can be used for HF research. While the 
results of some validation studies [65,67] have been promising, 
more validation studies conducted in different settings, and using 
different EMR platforms, are needed to confirm these findings. 

Limitations 

We acknowledge some limitations to our systematic review. 
There is the potential for a language bias as we could not consider 
articles whose full-texts were not available in English; articles 
published after the conclusion of our search period (February 
201 1) could not be considered either. Another potential limitation 
stems from the fact that, even though our database searches were 
conducted by an experienced librarian, administrative databases 
are not well catalogued in MEDLINE and EMBASE (e.g. no 
MeSH term pertaining to "administrative database"). Although 
most of the included studies were located through database 
searches, our subsequent hand search turned up several more 
relevant articles, most of which had not been indexed under terms 
relating to Administrative Data or Validation. As a result, despite 
our extensive hand search, we may have missed some relevant 
articles if they were not indexed in MEDLINE or EMBASE under 
a term relating to administrative data or validation. Our findings 
are also subject to publication bias, wherein reports of HF codes 
having poor validity may have been differentially withheld from 
publication. However, given the number of reports we located 
where the sensitivity of HF codes was suboptimal, we feel this is 
unlikely 

Recommendations 

After qualitative and quantitative analysis of the evidence, we 
conclude that the HF codes that do appear in administrative 
databases are highly predictive of true HF cases. At the same time, 
administrative databases fail to capture a non-negligible number of 
true cases, perhaps 25% to 30% of all diagnoses, and may 
differentially capture the most severe cases. Based on current 
evidence, we recommend several strategies for increasing the 
capture of HF cases in administrative data: 

• Hospitalizations with HF in the primary position are 
decreasing, while those with HF in secondary positions are 
increasing [26,27]. Thus, researchers should search amongst 
all available diagnostic positions in hospitalization data for HF 
codes. 



• With many HF patients treated exclusively on an outpatient 
basis, data from both inpatient and outpatient encounters 
should be searched. 

• Where available, researchers should supplement their data 
with searches of laboratory databases (specifically BNP values) 
and/ or prescription medication data. 

• Searching the free-text areas of the EMR for mentions of 
'heart failure' and related terms, as well as the Problem List, 
should help identify cases whose HF diagnosis has not been 
recorded on an inpatient or outpatient billing record. This may 
particularly aid in the capture of mild HF. 

Conclusions 

The chronic and syndromic nature of HF creates difficulties for 
researchers studying this condition at the population level. To 
guide their efforts, we conducted a systematic review and meta- 
analysis of articles reporting on the validity of HF diagnoses in 
administrative data. Our findings suggest that, although the HF 
diagnoses identified using administrative data frequently corre- 
spond to true HF cases, this data source may not capture all cases. 
Administrative databases are increasingly being used to study long- 
term patient outcomes and disease burden; thus, to maximize the 
sensitivity of these data sources for all conditions, physicians and 
hospital coders are encouraged to record diagnoses of all 
comorbidities that may have contributed to a given healthcare 
encounter. In the meantime, the use of broader case definitions, 
potentially in combination with prescription medication and 
laboratory data, and searches of electronic medical records, may 
increase the sensitivity of this data source for HF, and in turn, its 
apphcation in population-based health outcomes and economics 
research. 
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